Random Forest Analysis of Heart Failure Dataset

Tim Leschke, Pamela Mishaw, Sierra Landacre, Pallak Singh

Random Forest

  • Random Forest is a promising machine learning model because it correctly classifies data from large data sets, is resistant to outliers, and is easy to use.
  • Random Forest is a group of decision trees (a forest) that are created from identically distributed, independent random samples of data drawn with replacement from the original dataset (Breiman 2001).

Methods

  • Figure 1 shows the generation of a Random Forest. The uniqueness of each tree, as a result of random data samples, allows the ensemble to avoid misclassification, and improve classification accuracy.

The Gini Index is referred to as a measure of node purity (James et al. 2021). It can also be used to measure the importance of each predictor. The Gini Index is defined by the following formula where K is the number of classes and \({\hat{p}_{mk}}\) is the proportion of observations in the mth region that are from the kth class. A Gini Index of 0 represents perfect purity.

\[D=-\sum_{n=1}^{K} {\hat{p}_{mk}}(1-\hat{p}_{mk})\]

Bagging is the aggregation of the results from each decision tree. It is defined by the following formula where B is the number of training sets and \(\hat{f}^{*b}\) is the prediction model. Although bagging improves prediction accuracy, it makes interpreting the results harder as they cannot be visualized as easily as a single decision tree (James et al. 2021).

\[{\hat{f}bag(x) = 1/B \sum_{b=1}^{B}\hat{f}^{*b}(x)}\]

Data Used

  • The dataset comes from the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad Pakistan and consists of the medical records of 299 patients who experienced heart failure (Chicco and Jurman 2020).
  • There are 299 patient records with 13 features per record.
  • We use this dataset to create and evaluate a Random Forest model used for predicting heart failure events for patients.

Model Evaluation Metrics

  • The Out-of-Bag data is the data left unused by an individual predictor (decision tree) after a bootstrap sample is taken from the original data set.
  • The OOB data is used for internal validation– the prediction error rate is taken after running the “unseen” data through each tree and averaged to provide the overall OOB error rate. (Breiman 2001)
  • Shows the values of True Negative, False Negative, False Positive, and True Positive predictions produced by a model.
  • Precision: This metric quantifies how many of the observations labeled as positive are actually positive (Raschka, Liu, and Mirjalili 2022).
  • Recall: It quantifies how many of the positive observations have been predicted as positive (Raschka, Liu, and Mirjalili 2022).
  • F1 Score: Is the harmonic mean of the precision and recall and so used to assess predictive performance. (Raschka, Liu, and Mirjalili 2022).
  • Balanced Accuracy: Aids in reducing the effects of class imbalance in the data set so that accuracy of predicting the dominant class does not conceal the classification accuracy for the minority class (Brodersen et al. 2010).
  • Area under the receiver operating characteristic curve.
  • This is the model’s true positive rate (on the Y-axis) is plotted against the false positive rate (on the X-axis) using varying classification thresholds (Huang and Ling 2005).

Model Results

  • The testing accuracies of both models are slightly lower than their respective training accuracies which indicates overfitting of the training dataset.
  • The performance of model 2 is better than model 1. This is based off of the model’s training and testing accuracies; precision; F1; and balanced accuracy as they are all higher than those of the default model.

References

Methods

  • We assume the reader knows single classification trees.

The Random Forest

  • Random Forest uses multiple classification trees.

Random Forest Algorithm

The Gini Index

Bagging

Analysis and Results

  • We ingest the data into R-Studio
  • Perform classification with Random Forest
  • Perform analysis

Heart Failure Dataset

  • Dataset from Faisalabad Institute of Cardiology

  • 299 patient records

  • 13 features per record

  • Random Forest used to classify patients for heart failure

Dataset Features

Software and Hardware Configuration

  • RStudio Pro 2023.12.0, Build 369.pro3

  • Various R libraries

  • RStudio Server running on RHEL9 based virtual machine within a VMware VSphere HA cluster. The VM has 50 vCPU’s and 196 GB ram assigned

  • Hardware includes Dell PowerEdge R750 servers with Dual Xeon Gold 6338N (32 core) CPUs, 512 GB RAM, and sfp28 25 gbit networking for all communications

  • Cluster storage is from an NVMe based Dell SAN.

Model Evaluation Metrics

Various metrics are used to assess the Random Forest model performance:

  • Out of Bag error rate/accuracy - OOB data is data left unused by a decision tree.
  • Confusion Matrix - shows True Positive, False Positive, False Negative and True Negative values to support performance evaluation
  • Precision - also known as sensitivity and it represents how many observations labeled positive are actually positive.
  • Recall - quantifies how many positive observations are actually predicted as positive.
  • F1 - harmonic mean of the precision and recall; assesses predictive performance.
  • Balanced accuracy - average accuracy of both true-positive and true-negative classes.
  • AUC-ROC - area under a curve created by the true positive rate vs. false positive rate.

Variable Importance Plot

FourFold Plot (Confusion Matrix)

Confusion Matrix Heatmap

Variable Correlation Heatmap

ROC - Default Values

trainControl() - Random Selection

Variable Importance Plot - Tuned

FourFold Plot (Confusion Matrix)

Kappa Plot

Tuned vs. Default ROC

Chicco, Davide, and Giuseppe Jurman. 2020. “Machine Learning Can Predict Survival of Patients with Heart Failure from Serum Creatinine and Ejection Fraction Alone.” BMC Medical Informatics and Decision Making 20: 1–16.
James, Gareth, Daniela Witten, Hastie Trevor, and Robert Tibshirani. 2021. An Introduction to Statistical Learning, 2nd Edition. Springer of New York.